Search CORE

5 research outputs found

Causally Regularized Learning with Agnostic Data Selection Bias

Author: Csurka Gabriella
Dos Reis Virgile Landeiro
Lechner Michael
Li Da
Long Mingsheng
Long Mingsheng
Pearl Judea
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/08/2018
Field of study

Most of previous machine learning algorithms are proposed based on the i.i.d. hypothesis. However, this ideal assumption is often violated in real applications, where selection bias may arise between training and testing process. Moreover, in many scenarios, the testing data is not even available during the training process, which makes the traditional methods like transfer learning infeasible due to their need on prior of test distribution. Therefore, how to address the agnostic selection bias for robust model learning is of paramount importance for both academic research and real applications. In this paper, under the assumption that causal relationships among variables are robust across domains, we incorporate causal technique into predictive modeling and propose a novel Causally Regularized Logistic Regression (CRLR) algorithm by jointly optimize global confounder balancing and weighted logistic regression. Global confounder balancing helps to identify causal features, whose causal effect on outcome are stable across domains, then performing logistic regression on those causal features constructs a robust predictive model against the agnostic bias. To validate the effectiveness of our CRLR algorithm, we conduct comprehensive experiments on both synthetic and real world datasets. Experimental results clearly demonstrate that our CRLR algorithm outperforms the state-of-the-art methods, and the interpretability of our method can be fully depicted by the feature visualization.Comment: Oral paper of 2018 ACM Multimedia Conference (MM'18

arXiv.org e-Print Archive

Crossref

Removing Confounds in Text Classification for Computational Social Science

Author: Landeiro Dos Reis Virgile
Publication venue
Publication date
Field of study

Nowadays, one can use social media and other online platforms to communicate with friends and family, write a review for a product, ask questions about a topic of interest, or even share details of private life with the rest of the world. The ever-increasing amount of user-generated content has provided researchers with data that can offer insights on human behavior. Because of that, the field of computational social science - at the intersection of machine learning and social sciences - has soared in the past years, especially within the field of public health research. However, working with large amounts of user-generated data creates new issues. In this thesis, we propose solutions for two problems encountered in computational social science and related to confounding bias.First, because of the anonymity provided by online forums, social networks, or other blogging platforms through the common usage of usernames, it is hard to get accurate information about users such as gender, age, or ethnicity. Therefore, although collecting data on a specific topic is made easier, conducting an observational study with this type of data is not simple. Indeed, when one wishes to run a study to measure the effect of a variable on another variable, one needs to control for potential confounding variables. In the case of user-generated data, these potential confounding variables are at best noisily observed or inferred and at worst not observed at all. In this work, we wish to provide a way to use these inferred latent attributes in order to conduct an observational study while reducing the effect of confounding bias as much as possible. We first present a simple matching method in a large-scale observational study. Then, we propose a method to retrieve relevant and representative documents through adaptive query building in order to build the treatment and control groups of an observational study.Second, we focus on the problem of controlling for confounding variables when the influence of these variables on the target variable of a classification problem changes over time. Although identifying and controlling for confounding variables has been assiduously studied in empirical social science, it is often neglected in text classification. This can be understood by the fact that, if we assume that the impact of confounding variables does not change between the training and the testing data, then prediction accuracy should only be slightly affected. Yet, this assumption often does not hold when working with user-generated text. Because of this, computational science studies are at risk of reaching false conclusions when based on text classifiers that are not controlling for confounding variables. In this document, we propose to build a classifier that is robust to confounding bias shift, and we show that we can build such a classifier in different situations: when there are one or more observed confounding variables, when there is one noisily predicted confounding variable, or when the confounding variable is unknown but can be detected through topic modeling

repository.iit (Illinois Institute of Technology)

Using Matched Samples to Estimate the Effects of Exercise on Mental Health via Twitter

Author: Culotta Aron
Landeiro Dos Reis Virgile
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 09/02/2015
Field of study

Recent work has demonstrated the value of social media monitoring for health surveillance (e.g., tracking influenza or depression rates). It is an open question whether such data can be used to make causal inferences (e.g., determining which activities lead to increased depression rates). Even in traditional, restricted domains, estimating causal effects from observational data is highly susceptible to confounding bias. In this work, we estimate the effect of exercise on mental health from Twitter, relying on statistical matching methods to reduce confounding bias. We train a text classifier to estimate the volume of a user's tweets expressing anxiety, depression, or anger, then compare two groups: those who exercise regularly (identified by their use of physical activity trackers like Nike+), and a matched control group. We find that those who exercise regularly have significantly fewer tweets expressing depression or anxiety; there is no significant difference in rates of tweets expressing anger. We additionally perform a sensitivity analysis to investigate how the many experimental design choices in such a study impact the final conclusions, including the quality of the classifier and the construction of the control group

Association for the Advancement of Artificial Intelligence: AAAI Publications

You Can't Stay Here

Author: Bernal James Lopez
Bernstein Michael S
Biddle Sam
Burnap Peter
Cheng Justin
Cheng Justin
DePuy Venita
Dos Reis Virgile Landeiro
Eisenstein Jacob
Hankes Keegan
Kiesler Sara
Koebler Jason
Kwok Irene
Michael Paul J.
Newell Edward
Rosenbaum Paul R
Sim Yanchuan
Syed Nabiha Syed
Van Dyke Michelle Broder
Warner William
Weber Anne
Xu Jun-Ming
Xu Zhi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Fine-Grained Privacy Detection with Graph-Regularized Hierarchical Attentive Representation Learning

Author: Bahdanau Dzmitry
Bruna Joan
Diederik
Dos Reis Virgile Landeiro
Ganguly Debasis
Han Shuguang
Huang Xiaolei
Humphreys Lee
Jacob Laurent
Jozefowicz Rafal
Krizhevsky Alex
Kumar Abhishek
Lai Siwei
Le Quoc
Lei Zhu
Li Qimai
Liqiang Nie
Liu Bang
Liu Pengfei
Maas Andrew L.
Mosallanezhad Ahmadreza
Nguyen Cam Tu
Nie Feiping
Pan Sinno Jialin
Phan NhatHai
Ruiyang Ren
Singh Abhishek Kumar
Song Yi
Thomas
Tran Lam
Vasalou Asimina
Xiaolin Chen
Xie Ruobing
Xuemeng Song
Yang Zichao
Zhang Min Ling
Zhiyong Cheng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref